Churn Image

Telecom Churn Analysis: How To Keep Your Customers "On the Line"


Authors: Jared Mitchell, Andrew Marinelli, Wes Newcomb

Overview


In this notebook, we analyze and build classification models with the data from Syria, a telecom company, in an effort to understand the relationships and patterns between several customer variables and customer churn. After cleaning and encoding the data, we take an iterative and comparative approach to model production, eventually converging in a robust classification model that can determine with sufficient power the likelihood that a given customer will churn.

Business Understanding


Churn has long been king for companies wishing to determine the success of their product. Intuitively, customers wouldn't drop your service if they liked it, right? According to churn expert Patrick Campbell, "Your churn rate is a direct reflection of the value of the product and features that you're offering to customers." Further, when combined with other features of your service, such as cost, we can determine the price at which the offered service becomes most profitable. The idea is that we're willing to lose some customers to an increased cost of service as long as the double bottom line profit grows as a result. Thus, the question is born: Is there a way that we can predict churn on a client-by-client basis, so that we can shift from a reactive to a proactive approach to business decisions with respect to items such as product feature implementations, customer service operations, retention campaigns, and pricing optimization? The short answer is yes; armed with a predictive model, SyriaTel can not only make its service better, but it can also increase its profits.

Data Exploration


The SyriaTel dataset consists of 21 columns and 3333 rows. Each row represents information about a unique account holder. The dataset was complete and consistent for all rows upon our reception of it. It is not clear the time period that this dataset represents.

In [1]:
In [2]:
Out[2]:
state
account length
area code
phone number
international plan
voice mail plan
number vmail messages
total day minutes
total day calls
total day charge
...
total eve calls
total eve charge
total night minutes
total night calls
total night charge
total intl minutes
total intl calls
total intl charge
customer service calls
churn
0 KS 128 415 382-4657 no yes 25 265.1 110 45.07 ... 99 16.78 244.7 91 11.01 10.0 3 2.70 1 False
1 OH 107 415 371-7191 no yes 26 161.6 123 27.47 ... 103 16.62 254.4 103 11.45 13.7 3 3.70 1 False
2 NJ 137 415 358-1921 no no 0 243.4 114 41.38 ... 110 10.30 162.6 104 7.32 12.2 5 3.29 0 False
3 OH 84 408 375-9999 yes no 0 299.4 71 50.90 ... 88 5.26 196.9 89 8.86 6.6 7 1.78 2 False
4 OK 75 415 330-6626 yes no 0 166.7 113 28.34 ... 122 12.61 186.9 121 8.41 10.1 3 2.73 3 False

5 rows × 21 columns

The information provided per client includes how long they've been with SyriaTel in months (account length); which plans they are signed up for (international plan, voice mail plan); usage metrics (total day minutes, total night charge); the number of calls they made to customer support; and of course, churn status.

In [3]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   int64  
 15  total night charge      3333 non-null   float64
 16  total intl minutes      3333 non-null   float64
 17  total intl calls        3333 non-null   int64  
 18  total intl charge       3333 non-null   float64
 19  customer service calls  3333 non-null   int64  
 20  churn                   3333 non-null   bool   
dtypes: bool(1), float64(8), int64(8), object(4)
memory usage: 524.2+ KB
In [4]:
Out[4]:
account length
area code
number vmail messages
total day minutes
total day calls
total day charge
total eve minutes
total eve calls
total eve charge
total night minutes
total night calls
total night charge
total intl minutes
total intl calls
total intl charge
customer service calls
count 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000
mean 101.064806 437.182418 8.099010 179.775098 100.435644 30.562307 200.980348 100.114311 17.083540 200.872037 100.107711 9.039325 10.237294 4.479448 2.764581 1.562856
std 39.822106 42.371290 13.688365 54.467389 20.069084 9.259435 50.713844 19.922625 4.310668 50.573847 19.568609 2.275873 2.791840 2.461214 0.753773 1.315491
min 1.000000 408.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 23.200000 33.000000 1.040000 0.000000 0.000000 0.000000 0.000000
25% 74.000000 408.000000 0.000000 143.700000 87.000000 24.430000 166.600000 87.000000 14.160000 167.000000 87.000000 7.520000 8.500000 3.000000 2.300000 1.000000
50% 101.000000 415.000000 0.000000 179.400000 101.000000 30.500000 201.400000 100.000000 17.120000 201.200000 100.000000 9.050000 10.300000 4.000000 2.780000 1.000000
75% 127.000000 510.000000 20.000000 216.400000 114.000000 36.790000 235.300000 114.000000 20.000000 235.300000 113.000000 10.590000 12.100000 6.000000 3.270000 2.000000
max 243.000000 510.000000 51.000000 350.800000 165.000000 59.640000 363.700000 170.000000 30.910000 395.000000 175.000000 17.770000 20.000000 20.000000 5.400000 9.000000
In [5]:
Out[5]:
False    2850
True      483
Name: churn, dtype: int64

We are apparently dealing with an unbalanced dataset, which means that we will have to be careful in applying proper weights to our outcome groups. However, the imbalance is not so extreme as to beckon measures that hedge against the potential skewing of results.

Data Preparation


Some of the columns were immediately identifiable as being not relevant to predicting churn - such as phone number and area code. We dropped these columns from the data set outright. One might immediately think that area code is connected to customer region. However, some people (perhaps a majority) keep their numbers when they move, so someone with a San Francisco phone number could very well be living in South Dakota. We can safely conclude that area code does not contain robust customer information, and because phone numbers are semi-randomly generated, the same can be said for its column.

In [6]:

Some of the columns needed to be reformatted from yes/no to binary, 1/0 style.

In [7]:

Further, the successes of some telecom companies are regional specific - perhaps they offer great coverage in some regions, but terrible coverage in other regions. However, this is not the case for SyriaTel, as they do not have characteristic regional customer counts or churn rates. We can see this from the following visual representations.

In [8]:
In [9]:
405060708090100Number of ClientsState Rankings By Customer Count
In [10]:

There is a significant difference in churn rate by state, which we can see in the bar chart below.

In [11]:

While some areas are do have significantly higher churn rates than others, we cannot take this as chracteristic of the state customer populations because the populations by state are small - if we only have data on 30 customers from California, we cannot be confident that the statistcs on those 30 customers represent all the customers in California. Finally, we can say with relative certainty that the regional representations in the dataset do not chracterize the churn rates - states with a larger representation are not more likely to represent high or lower churn rates, generally speaking, than states with smaller representation.

In [12]:
Out[12]:
value_count
churn
value_count 1.000000 -0.001216
churn -0.001216 1.000000

As a result, we can drop the state column from the dataset.

In [13]:

We do have some columns that are nearly perfectly collinear as well:

In [14]:

Since number of user minutes per time period is a more direct metric than charge per corresponding time period; and because number of voicemail messages and total international charge are directly consistent with whether a customer has a plan for each, respectively, we can safely drop those columns from the dataset.

In [15]:

Train-test split, et voilà ! The dataset is clean and ready for production.

In [16]:

Modeling


We took an iterative approach to modeling the data: starting with a baseline, then progressing through the simplest models; and finally, exploring more advanced models.

We determined that a customized F-style score is the most appropriate metric for measuring the success of our model because (1) we are dealing with imbalanced data, thus a skewed high baseline accuracy; and (2) we are interested in a healthy medium between identifying customers who are going to churn and misidentifying customers who are not going to churn. You could understand this simply as: each customer we save from churn yields a large increase in revenue; where every customer we misidentify as churn yields a small decrease revenue. Thus we decided to optimize our models against the F4-score: a model scoring system that quadruple weights recall with respect to precision. We do this because we understand that the cost of losing a customer is far more expensive that the cost of accidentally being overly generous with a customer who we were not going to lose in the first place.

In [17]:
In [18]:

Baseline Model

Logically, we should start out with the current state of the company. SyriaTel does not have any way of predicting customer churn at the moment; in other words, SyriaTel treats every customer as though they were not going to churn. Our baseline model reflects this strategem.

In [19]:
Out[19]:
array([0., 0., 0., 0., 0.])
In [20]:
              precision    recall  f1-score   support

           0       0.85      1.00      0.92       713
           1       0.00      0.00      0.00       121

    accuracy                           0.85       834
   macro avg       0.43      0.50      0.46       834
weighted avg       0.73      0.85      0.79       834

Clearly, we are starting from zero here: Our F4-score is 0, even though we do have 85% accuracy.

Logistic Regression

We decided that Logistic Regression would make a good starting point as it is relatively simple to understand and easy to implement. Our approach iteratively added features with respect to feature importance as judged by its relative ability to determine our target variable. Additionally, we tuned our Logistic Regressor hyperparameters over sufficient parameter space to say that this is the best logistic regression model obtainable, given certain time and complexity constraints. We also experimented with different data preprocessing techniques.

In [21]:
In [22]:
In [23]:
Train Statistics
              precision    recall  f1-score   support

           0       0.95      0.80      0.87      2137
           1       0.39      0.76      0.52       362

    accuracy                           0.79      2499
   macro avg       0.67      0.78      0.69      2499
weighted avg       0.87      0.79      0.82      2499


Test Statistics
              precision    recall  f1-score   support

           0       0.94      0.78      0.85       713
           1       0.35      0.70      0.47       121

    accuracy                           0.77       834
   macro avg       0.65      0.74      0.66       834
weighted avg       0.85      0.77      0.80       834

This model shows a significant improvement from baseline, with an F1-score of 0.47 while only sacrificing 8 accuracy points. Additionally, it does not appear as though we are overfitting.

k-Nearest Neighbors

We also took an iterative approach to creating our kNN model, incorporating features into our model with resepect to their correlation with the target variable, churn; however, we do not contend that the feature combination we found to produce the best model is necessarily the best combination. Feature selection in kNNs can be particularly challenging, and the most modern methods are not comprehensive.

In [24]:
In [25]:
In [26]:
In [27]:

We can see from the graphic below that we achieve our maximum F1-score at 8 features, while maintaining healthy accuracy, recall, and precision levels.

In [28]:
In [29]:
Out[29]:
KNeighborsClassifier(n_neighbors=3, p=1, weights='distance')
In [30]:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2137
           1       1.00      1.00      1.00       362

    accuracy                           1.00      2499
   macro avg       1.00      1.00      1.00      2499
weighted avg       1.00      1.00      1.00      2499

              precision    recall  f1-score   support

           0       0.93      0.95      0.94       713
           1       0.66      0.57      0.61       121

    accuracy                           0.89       834
   macro avg       0.79      0.76      0.77       834
weighted avg       0.89      0.89      0.89       834

We see a significant improvement in both overall accuracy as well as F1-score with respect to the logistic regession model. Not bad! However, we may be overfitting, which we can see from the dramatic differences between train and test scores across various metrics. To resolve this, we will remove some of the input features to reduce model complexity.

Naive Bayes

Naive Bayes offers another simple approach to predicting churn. We chose not to incorporate feature selection here because Naive Bayes usually benefits from more features, given that that number is not extraneously large (in the order of hundreds).

In [31]:
In [32]:
Train
              precision    recall  f1-score   support

           0       0.90      0.93      0.92      2137
           1       0.50      0.43      0.46       362

    accuracy                           0.85      2499
   macro avg       0.70      0.68      0.69      2499
weighted avg       0.85      0.85      0.85      2499


Test
              precision    recall  f1-score   support

           0       0.90      0.91      0.91       713
           1       0.44      0.41      0.43       121

    accuracy                           0.84       834
   macro avg       0.67      0.66      0.67       834
weighted avg       0.83      0.84      0.84       834

This may not be our best-performing model, but its results are still better than our baseline. Additionally, we do not have evidence of a case of overfitting, as the test evaulation metrics reflect the train evaulation metrics to a high degree.

Simple Decision Tree

Our final simple model, the decision tree is powerful because of its bare-bones preprocessing requirements. There is no concept of recursive feature elimination for decision trees, but we still took a grid search approach to determing the best hyperparameters.

In [33]:
In [34]:
Train
              precision    recall  f1-score   support

           0       0.97      0.83      0.90      2137
           1       0.47      0.87      0.61       362

    accuracy                           0.84      2499
   macro avg       0.72      0.85      0.75      2499
weighted avg       0.90      0.84      0.86      2499


Test
              precision    recall  f1-score   support

           0       0.95      0.79      0.86       713
           1       0.38      0.77      0.51       121

    accuracy                           0.79       834
   macro avg       0.67      0.78      0.69       834
weighted avg       0.87      0.79      0.81       834

Once again, we are beating our baseline; but there is a good chance that we are overfitting, evinced by the lack of coordination between train and test results.

Complex Models


In this section, we will explore the results of more complex models, which are generally capable of better predictions but at the cost of higher computational complexity. Note that we did not choose these models for any particular reason; though our motivation for XGBoost comes from its fame within the industry.

Random Forest

We expect that the random forest will perform well - and probably better than any of our simple models.

In [35]:
In [36]:
Train
              precision    recall  f1-score   support

           0       0.98      0.93      0.95      2137
           1       0.67      0.87      0.76       362

    accuracy                           0.92      2499
   macro avg       0.83      0.90      0.86      2499
weighted avg       0.93      0.92      0.92      2499


Test
              precision    recall  f1-score   support

           0       0.96      0.90      0.93       713
           1       0.56      0.79      0.66       121

    accuracy                           0.88       834
   macro avg       0.76      0.84      0.79       834
weighted avg       0.90      0.88      0.89       834

It looks as though our random forest has given us our best results thus far. However, it may be overfitting some, so we need to tweak a few of the parameters to ensure that our model does not overfit to the train data.

XGBoost

And finally - the holy grail of machine learning classification models - XGBoost. Because of its track record as being a superstar algorithm amongst classification models, we expect that XGBoost will top all of our models thus far by decent margin.

In [37]:
In [85]:
In [86]:
Train
              precision    recall  f1-score   support

           0       0.97      1.00      0.99      2137
           1       1.00      0.84      0.91       362

    accuracy                           0.98      2499
   macro avg       0.99      0.92      0.95      2499
weighted avg       0.98      0.98      0.98      2499


Test
              precision    recall  f1-score   support

           0       0.94      0.99      0.96       713
           1       0.90      0.60      0.72       121

    accuracy                           0.93       834
   macro avg       0.92      0.80      0.84       834
weighted avg       0.93      0.93      0.93       834

By comparing our evaulation metrics, we know that our model is overfitting. Let's mitigate the issue by tuning some of our hyperparameters!

In [87]:
Out[87]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('xgb',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=0.5, gamma=0, gpu_id=-1,
                               importance_type='gain',
                               interaction_constraints='', learning_rate=0.2,
                               max_delta_step=0, max_depth=5,
                               min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=0, num_parallel_tree=1, random_state=0,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=0.4,
                               subsample=0.5, tree_method='exact',
                               validate_parameters=1, verbosity=None))])
In [88]:
Out[88]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('xgb',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=2.6, gpu_id=-1,
                               importance_type='gain',
                               interaction_constraints='', learning_rate=0.2,
                               max_delta_step=0, max_depth=5,
                               min_child_weight=2, missing=nan,
                               monotone_constraints='()', n_estimators=25,
                               n_jobs=0, num_parallel_tree=1, random_state=0,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                               subsample=0.7, tree_method='exact',
                               validate_parameters=1, verbosity=None))])
In [89]:
Train
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      2137
           1       0.98      0.83      0.90       362

    accuracy                           0.97      2499
   macro avg       0.97      0.92      0.94      2499
weighted avg       0.97      0.97      0.97      2499


Test
              precision    recall  f1-score   support

           0       0.96      0.99      0.97       713
           1       0.91      0.74      0.82       121

    accuracy                           0.95       834
   macro avg       0.93      0.87      0.90       834
weighted avg       0.95      0.95      0.95       834

Evaluation


We evaluate our models based on their abilities to maximize the number of identified churning customers while minimizing the misidentified non-churning customers. Again, the F-4 score comes in handy because it enables us to quadruple weight recall as to incentivize our model to treat the identification of churning customers with higher priority than hedging against the misidentification of non-churning customers. This is because mitigating a single unit of churn has a higher return in revenue than does the loss associated with a single unit of non-churn, we estimate in the around the order of 4 times.

Additional considerations include model complexity/interpretability and predictive time consumption.

In [92]:
In [93]:

Based soley on our metrics, we would expect that XGBoost would do the best job in the field. However, Random Forest and kNN also perform well. We eliminate kNN because of its lack of interpretability. Between Random Forest and XGB, the better model would be decided based on retnetion campaign metrics - so we cannot say for certain which would be better for SyriaTel until be know more about their client base. Either will perform to an extent to give SyriaTel an edge over its competition, and will lead to an increase in revenue overall for SyriaTel.

Conclusion


There were several ends that we did not investigate. For example, we did not explore several combinations of our variables per model, and we did rule out some possibilities early on as a result of what one might term pragmatism that could have yielded some signficant result. We classify our approach as that which is most likely to yield the best results given time constraints. We also did have several cases of potential overfitting that we may not have addressed sufficiently. With this given, we believe that we have perfomed as exhaustive a search as we could have.

That said, we believe that our final model will be able to boost overall revenues for SyriaTel by a company-changing amount. With our model, the future is brighter than it looked yesterday.

Future Research


We absolutely must pursue several leads in order to fuel optimal model development:

  • Find more robust churn predictors, such as dropped calls per customer or internet download speed per region. If we know why people are churning, we can do a better job addressing churn.
  • Analyze promotional success, we need to know the types and the success rates of various promotionals so that we can better calibrate our model for overall revenues.
  • Market research, we need to know why customers are churning and which companies they're going to after they churn. That way, we know where we can improve.
  • Customer survey, we need to know on a case-to-case level how customers are using our service. That way, we know what to offer each customer in the case that they show warning signs of churn.

Overall, there is still much work to be done. We look forward to working with SyriaTel through these challenges in order to maximize overall revenue.